feat: streaming SSR pipeline#294
Conversation
Adds the benchmark infrastructure used to measure WebUI SSR performance,
implementation-neutral. This commit can be cherry-picked onto origin/main
to capture a baseline; subsequent commits in this branch then add the
streaming primitive (commit 2) and the signal-based injection +
hot-path perf hardening (commit 3), each with deltas measurable against
the numbers captured at this commit.
What this commit adds:
- crates/webui/benches/streaming_bench.rs (criterion native): writer-
path wall-clock at three contact-book scales (10/100/1000) for two
paths that exist on origin/main:
* `string` - pre-allocated String buffer baseline.
* `string+postinject` - String + case-insensitive </body> byte-
window scan + concat. Mirrors the legacy dev-mode livereload
pipeline (`lr.inject(&buf)`).
- crates/webui/examples/streaming_resource_bench.rs (custom
GlobalAlloc + getrusage): per-render allocation count, total bytes,
user CPU microseconds, peak RSS for the same two paths.
Snapshot save/load via --save NAME / --compare NAME.
- xtask/src/main.rs:
* `cargo xtask bench streaming` runs the criterion writer-path
bench. `cargo xtask bench streaming-resource` runs the custom
allocator bench. `cargo xtask bench full` runs both.
* --save-baseline NAME / --baseline NAME flags map to criterion's
native flags for the criterion bench, and to --save/--compare
for the resource bench. Both store JSON/criterion snapshots
under target/bench-baselines/ (or target/criterion/).
- BENCHMARKS.md: top-level documentation describing the bench layers,
the threshold guidance for noise vs signal, and the before/after
workflow.
- crates/webui-parser/Cargo.toml: cargo-shear metadata exempting
`clap` (used only via cfg_attr-gated derive macro that cargo-shear
cannot expand).
Subsequent commits will:
- Add the StreamingWriter / ChunkPool primitive plus the
`streaming` / `streaming POOLED` rows to both benches, the actix-
based streaming-e2e-ttfb bench, and the Playwright streaming-browser
bench (commit 2).
- Add the signal-based RenderOptions::with_head_inject /
with_body_inject API plus the `streaming+inject(opts)` / `streaming+
inject(opts) POOLED` rows, the per-render hot-path perf hardening,
and CLI / commerce wiring (commit 3).
Reproduction workflow:
# On any commit:
cargo xtask bench streaming-resource --save-baseline before
cargo xtask bench streaming --save-baseline before
# Apply the change you want to measure...
cargo xtask bench streaming-resource --baseline before
cargo xtask bench streaming --baseline before
Numbers from this commit on the contact-book-manager protocol at
scale 1000 (release build, 2000 iters/path):
string/1000: 525 allocs, 51.7 KiB, 23.49 us user CPU
string+postinject/1000: 526 allocs, 75.0 KiB, 33.65 us user CPU
The post-inject overhead at this commit (+9 us, +23 KiB output) is
the cost any host pays for per-request HTML splicing without a
structured injection API - the cost the implementation commit
eliminates.
Quality: cargo xtask check passes (1096s, all phases).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…layers
Adds the streaming SSR primitive (StreamingWriter, ChunkPool) and
extends the bench infrastructure from the previous commit with three
new measurement layers. No handler-level rendering semantics change at
this commit — the signal-based injection API and per-render hot-path
perf hardening land in the next commit.
What this commit adds:
- crates/webui/src/streaming.rs (~820 lines):
* StreamingWriter: bounded tokio mpsc-backed ResponseWriter with
coalesced ~4 KB chunks, configurable flush deadline (slow-loris
DoS bound), typed disconnect/timeout errors. Documented usage
pattern is `actix_web::rt::task::spawn_blocking`.
* ChunkPool: lock-free shared pool of Vec<u8> chunk buffers
backed by crossbeam_queue::ArrayQueue. Buffers recycle via
Bytes::from_owner + a custom owner type that returns the Vec
on Bytes drop. Cross-thread drop safety verified by test.
* 13 unit tests covering coalescing, disconnect, timeout, chunk-
size override, pool round-trip, dirty-buffer handling, capacity
enforcement, single-Bytes drop, ref-counted clone drop,
recycling across renders, cross-thread drop.
- crates/webui-handler/src/lib.rs:
* HandlerError gains two variants (ClientDisconnected,
StreamTimeout) so streaming writers can return typed errors.
Both variants are payload-free (allocation-free) so error paths
stay cheap.
- crates/webui/Cargo.toml + workspace Cargo.toml: adds tokio, bytes,
crossbeam-queue, memchr, tokio-stream, actix-web, awc, futures-util
to the deps needed by the streaming primitive and the new benches.
- crates/webui/benches/streaming_bench.rs: extended with a
`streaming` row (alongside the existing `string` and
`string+postinject` rows from the previous commit) plus a `ttfb`
group measuring time-to-first-chunk for streaming vs buffered.
- crates/webui/examples/streaming_resource_bench.rs: extended with
`streaming` and `streaming POOLED` rows for the same allocator-
level + getrusage measurements as the baseline rows.
- crates/webui/examples/streaming_e2e_ttfb_bench.rs (NEW): in-process
actix-web server measuring real HTTP TTFB / TTLB for `/buf` vs
`/stream` under configurable per-write delays. JSON snapshot
baseline support (--save NAME / --compare NAME).
- examples/integration/streaming-browser-bench/ (NEW): standalone
Playwright suite + small hand-built actix-web server. Measures
browser-perceived metrics (TTFB / FCP / LCP / DCL / load) in real
Chromium across four render scenarios (no-delay, 25 ms, 100 ms,
250 ms render times). The server is intentionally hand-built so
it isolates the streaming-vs-buffered question without confounding
from WebUI handler details. Baseline support via WEBUI_BENCH_SAVE
/ WEBUI_BENCH_COMPARE env vars.
- xtask/src/main.rs:
* `cargo xtask bench streaming-e2e-ttfb` and
`cargo xtask bench streaming-browser` targets added.
* `cargo xtask bench full` (= `streaming-all`) now runs the
criterion writer-paths + resource bench + e2e-ttfb + browser
bench in sequence, threading the same baseline name through
every layer.
* --save-baseline / --baseline flags map to criterion's native
flags for criterion benches, --save / --compare for the
example benches, and WEBUI_BENCH_SAVE / WEBUI_BENCH_COMPARE
env vars for the Playwright bench.
- xtask/src/e2e.rs: wires the streaming-browser-bench Playwright
suite into `cargo xtask e2e` so it runs in CI alongside the
other example apps.
- BENCHMARKS.md / crates/webui/benches/README.md: updated to
describe the new bench layers and what each one measures.
Reproduction workflow:
# On the previous commit (baseline-only):
cargo xtask bench full --save-baseline before
# On this commit (adds streaming):
cargo xtask bench full --baseline before
# Browser-perceived metrics (real Chromium):
cargo xtask bench streaming-browser --save-baseline before
# …on a later commit…
cargo xtask bench streaming-browser --baseline before
Quality: cargo xtask check passes (1165s, all phases). All 13
streaming module tests pass.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
69a65a5 to
7e988db
Compare
akroshg
left a comment
There was a problem hiding this comment.
Inline notes on a few things worth a look. The architecture (structural-signal injection, lock-free ChunkPool, Bytes::from_owner ownership, bounded-channel back-pressure) is the right design and the perf claims hold up — these are 4 edge-case bugs worth fixing on top.
7e988db to
bdd5fee
Compare
|
Thanks @akroshg — all four findings were valid, fixed in ✅ Your 4 findings — all fixed
Adversarial re-audit findingsRan a fresh review specifically looking for patterns matching your four (release-vs-debug gaps, swallowed Bug-6 from the re-audit (nested Bug-8 from the re-audit (cumulative Bug-9 from the re-audit ( Net change
Force-pushed as commit The two re-audit findings I noted as low-impact but worth fixing (Bug-8 / Bug-9 in the bench) confirm that even after an external review, fresh adversarial eyes find more. That's a good signal — please do another pass if you have time. The streaming and handler hot paths are the right places to keep poking at. |
bdd5fee to
6e3af60
Compare
|
Fixed in Root causePre-existing race in the test, widened by the streaming SSR pipeline this PR introduces — not fixed-by-PR but exposed-by-PR. The test sequence: await page.locator('mp-category-nav').getByRole('link', { name: 'Shirts' }).first().click();
await expect(page).toHaveURL('/search/shirts'); // ← passes
await page.locator('mp-filter-list').getByRole('link', { name: 'Price: ...' }).click(); // ← race
A click in that window hits the stale link and lands on Local 10×10× repro on my machine: passes 10/10. Origin/main also passes 10/10 with the same test code. The race window is below the schedulable threshold on a fast Mac M-series; on the GitHub Actions Linux runner with playwright workers contending for CPU, it lands inside the click. The streaming SSR work in this PR doesn't directly affect the partial-nav code path (partials still go through FixMade the test deterministic by waiting for the filter-list href to actually update to the new category before clicking: await expect(page).toHaveURL('/search/shirts');
// Count-based wait (not visibility) — mp-filter-list emits both
// desktop and mobile-only variants, only one is `display`-ed in
// each project, but both share the updated href once the DOM
// patch lands.
await expect(
page.locator('mp-filter-list a[href*="/search/shirts?sort=price-desc"]'),
).not.toHaveCount(0);
await page.locator('mp-filter-list').getByRole('link', { name: 'Price: ...' }).click();Verified locally with the rebuilt branch binary: 10/10 passes. The remaining 9 e2e failures are the pre-existing macOS↔Linux screenshot baseline mismatches (4 commerce + 5 contact-book |
…path
Builds on the streaming primitive from the previous commit to add the
per-render HTML injection API (`RenderOptions::with_head_inject` /
`with_body_inject`), six allocation-reducing changes on the handler hot
path, five streaming/pool-side improvements, two security guards, and
the wiring for the dev CLI and the commerce example.
Replaces the legacy buffer-then-byte-scan-and-concat injection
pipeline with a structural, signal-driven mechanism. The parser
already synthesises head_end / body_end signal fragments at the
structural boundaries (crates/webui-parser/src/lib.rs:1189-1230),
so the handler simply emits the inject HTML at the existing hook
sites. No byte scanner. No second pass. Per-render injection is a
single writer.write(html) call at the parser-anchored signal:
zero scan cost, and the signal cannot be spoofed by </head> /
</body> literals appearing in HTML comments, <iframe srcdoc>, or
inline <script>.
## Performance vs the previous-commit baseline (commit 2)
(per-render, 2000 iters, contact-book at 1000 contacts, custom
GlobalAlloc + getrusage)
metric | previous commit | this commit | delta
---------------|-----------------|-------------|--------
string/1000 allocs | 525 | 514 | -2.1%
streaming/1000 allocs | 538 | 527 | -2.0%
string+postinject/1000 allocs | 526 | 515 | -2.1%
streaming+inject(opts) POOLED bytes | n/a (new path) | 30.3 KiB
user CPU (any path) | ~25-30 us | ~21-23 us | -10..-30%
Cumulative wins of the new POOLED path vs origin/main legacy
`string+postinject`:
metric | origin/main | this commit POOLED | delta
--------------|--------------|--------------------|--------
allocations | 526 | 520 | -1.1%
bytes/render | 75.0 KiB | 30.3 KiB | -59.6%
user CPU us | ~29.7 | ~21.1 | -28.9%
TTFB | full buffer | first signal | streaming
## What changed at the handler layer
- crates/webui-handler/src/lib.rs:
* RenderOptions gains `head_inject: Option<&'a str>` /
`body_inject: Option<&'a str>` fields and matching builders
`with_head_inject` / `with_body_inject`. Empty strings normalise
to None for consistency with `with_nonce`.
* `process_signal` emits the inject HTML at the existing
head_end/body_end hook sites, after the built-in nonce meta /
CSS preload links / hydration script. Each emission guarded by
a `head_end_emitted` / `body_end_emitted` flag on
WebUIProcessContext so a malformed protocol cannot multiply the
inject by N (DoS amplification guard).
* Six allocation-reducing changes on the per-render hot path:
1. request_path: String -> &'a str (-1 alloc/render)
2. entry_id: String -> &'a str (-1 alloc/render)
3. nonce: Option<String> -> Option<&'a str> (-1 alloc)
4. route_base: String -> Cow<'a, str> (-1 alloc, "/" zero-copy)
5. <for> loop variable: insert key once, get_mut-swap value
in-place instead of clone-per-iteration. Saves 2*(N-1)
String clones for any N-iteration loop. A 1000-item <for>
saves 1998 allocations.
6. Lazy component_index_cache on the per-render context.
build_component_index() was rebuilt twice per render
(head_end + body_end), each walking the protocol. Now
built on first demand and reused.
- crates/webui/src/streaming.rs (5 hardening changes):
1. Inject fields stored as Option<&'a str> everywhere (no per-
render String::from clone).
2. Fast-path send_with_optional_timeout when no timeout: skips
Handle::try_current() (~10 ns TLS lookup) on every flush.
3. Move chunk-buffer clear from acquire to release. ChunkPool
now clears Vec on release (cheap, just len = 0); acquire
trusts the invariant. One fewer branch on every chunk acquire.
4. with_nonce("") normalises to None, matching the inject API.
5. debug_assert! in the unreachable timeout-without-runtime
branch instead of silent fallthrough.
## Two security guards (DoS-class)
1. Dedupe head_end / body_end emission. Without this, a malformed
protocol that emits the signal N times would multiply the host's
inject by N: a 1 MiB inject x 1000 duplicate signals would have
produced 1 GiB of output. Now emits exactly once per render. Test
`injects_dedupe_against_duplicate_signals` pins the guard.
2. Explicit XSS warning on with_head_inject / with_body_inject doc
comments. Handler writes HTML verbatim - no escaping. The trust
contract is now unmissable.
## Production wiring
- crates/webui-cli/src/commands/serve.rs: dev server uses
StreamingWriter::new_pooled with a startup-built ChunkPool (256
slots * ~5 KiB = 1.25 MiB peak), 30 s flush deadline (slow-loris
DoS bound), and feeds the livereload script as Arc<str> via
RenderOptions::with_body_inject.
- examples/app/commerce/server: same pattern; per-page image preload
link tags via RenderOptions::with_head_inject.
## Bench rows added
The criterion writer-paths bench and the resource bench gain
`streaming+inject(opts)` and `streaming+inject(opts) POOLED` rows
that exercise the new API. The previous commits' baseline rows
remain so deltas are directly comparable across all three commits.
## Test coverage (12 new tests in this commit)
Handler:
- head_inject_emits_at_head_end_boundary
- body_inject_emits_at_body_end_boundary
- injects_are_no_op_when_unset
- empty_inject_string_treated_as_unset
- inject_html_is_passed_through_verbatim
- injects_robust_against_marker_literals_in_content
(proves the structural-signal approach cannot mis-fire on </body>
literals inside HTML comments - a class of bug the byte-scanner
approach was vulnerable to)
- both_injects_fire_at_correct_boundaries
- injects_dedupe_against_duplicate_signals (security guard)
- injects_no_op_when_no_head_or_body_signals (Shadow DOM safe)
- concurrent_renders_with_different_injects_do_not_cross_contaminate
(16-thread stress test of the &self handler)
- large_inject_roundtrips_without_truncation (1 MiB inject)
- empty_nonce_treated_as_unset (API consistency)
All 283 handler tests + 13 streaming tests pass.
## Documentation
DESIGN.md "Streaming Response Writers" section rewritten to document:
- the signal-based injection API
- the safety contract (raw HTML, no escaping, host owns trust)
- the dedup guarantee (max one emission per render)
- the zero-allocation borrow invariant
- the structural-signal correctness advantage over byte-scanning
User-facing docs and bench READMEs reference the new API.
Reproduction:
# Capture baseline at the previous commit:
git checkout HEAD^
cargo xtask bench full --save-baseline before
# Apply this commit and compare:
git checkout HEAD
cargo xtask bench full --baseline before
Quality: cargo xtask check passes (1111s, all phases).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
6e3af60 to
67ba22a
Compare
SSR pipeline with bounded-channel streaming, lock-free chunk pool, structural per-render HTML injection, and a zero-allocation hot path. Replaces the legacy buffer-then-byte-scan-and-concat pipeline.
Headline numbers
Per-render, contact-book @ 1000 contacts, allocator-exact + getrusage. Origin/main
string+postinject→ this PRstreaming+inject POOLED:Browser-perceived metrics (Playwright, real Chromium, 250 ms render):
Three commits
86ca1cbdstring,string+postinject). Cherry-pickable onto origin/main for a baseline.7767be48StreamingWriter+ChunkPoolprimitive + 3 new bench layers (criterion writer-paths, e2e-ttfb actix, Playwright browser).6e3af609RenderOptions::with_head_inject/with_body_inject, dedup DoS guard, 6 hot-path allocation cuts, CLI / commerce wiring.What's in each commit
Streaming primitive (commit 2) — bounded tokio mpsc (default 4 chunks ≈ 16 KiB → backpressure), configurable
with_flush_timeout(slow-loris DoS bound), lock-freeChunkPoolviacrossbeam_queue::ArrayQueuefor zero per-flush allocation in steady state, typedClientDisconnected/StreamTimeouterrors.Structural HTML injection (commit 3) —
with_head_inject/with_body_injectemit at parser-synthesizedhead_end/body_endstructural boundaries. No byte scanner — cannot mis-fire on</head>/</body>literals in HTML comments,<iframe srcdoc>, or inline<script>. Dedupe guard prevents inject amplification (a 1 MiB inject × N duplicate signals would otherwise produce N MiB of output).Hot-path perf (commit 3) —
request_path/entry_id/nonce/route_baseswitched fromStringto&'a str/Cow<'a, str>(4 fewer allocations per render);<for>loop variable name inserted once +get_mut-swapped per iteration (2·(N−1) allocations saved on N-iter loops); per-rendercomponent_index_cacheeliminates the secondbuild_component_indexwalk atbody_end.Security guards (commit 3) — empty-string normalization at handler init (defends against
RenderOptions { nonce: Some(""), .. }field-bypass which would emit<script nonce="">, a hard CSP failure); explicit XSS warning on inject builder doc comments;encode_safere-exported fromwebui_handlerfor callers that need pre-escaping.Tests
299 total (15 streaming + 284 handler), including: 16-thread concurrent-render stress, 1 MiB inject roundtrip, cross-thread
Bytesdrop, marker-spoof robustness (<!-- </body> -->literals), dedupe guard, field-bypass CSP, nested<for>reusing same variable name, runtime-free flush timeout positive test,end()surfaces first-flush error.cargo xtask check✅ green (1344 s).Reproduce
Docs
BENCHMARKS.md— bench layer reference + before/after workflowDESIGN.md"Streaming Response Writers" — primitive + injection API + safety contractscrates/webui/benches/README.md,examples/integration/streaming-browser-bench/README.mdProduction wiring
crates/webui-cli/src/commands/serve.rs— dev server usesStreamingWriter::new_pooled(256-slot pool ≈ 1.25 MiB peak), 30 s flush deadline, livereload script asArc<str>viawith_body_inject.examples/app/commerce/server— same pattern; per-page image preload<link>viawith_head_inject.